Take-Home Exercise 3b: Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods

Author

Lorielle Malveda

Published

November 2, 2024

Modified

November 9, 2024

1. OVERVIEW

Housing is a key part of wealth for households around the world, with buying a home often being one of the biggest investments people make. Housing prices are influenced by many factors, including broader economic ones like inflation and local factors specific to the property, such as its size or location near services like schools and shopping centers. Traditionally, price prediction models used a method called Ordinary Least Squares (OLS), but this approach doesn’t account for the geographical patterns in housing data, which can lead to inaccurate results. To improve accuracy, Geographically Weighted Models (GWMs) were introduced to better predict housing prices by considering these location-based differences.

2. THE TASK

For this exercise, your task is to build a predictive model to forecast HDB resale prices for the period of July to September 2024, using transaction records from HDB resale data in 2023.

3. THE DATA

For this exercise, you’ll use the HDB Resale Flat Prices dataset from Data.gov.sg as the primary source. The analysis should focus on resale price predictions for either three-room, four-room, or five-room flats. Here are recommended predictors you can use, though you’re welcome to include other relevant variables.

Structural factors: - Area of the unit - Floor level - Remaining lease - Age of the unit - Main Upgrading Program (MUP) completion (optional)

Locational factors: - Proximity to the Central Business District (CBD) - Proximity to eldercare facilities - Proximity to food courts/hawker centers - Proximity to MRT stations - Proximity to parks - Proximity to good primary schools - Proximity to shopping malls - Proximity to supermarkets - Number of kindergartens within 350 meters - Number of childcare centers within 350 meters - Number of bus stops within 350 meters - Number of primary schools within 1 kilometer

These variables provide a foundation for modeling, helping capture both the physical characteristics of the flats and the surrounding amenities that may impact resale prices.

4. GEOSPATIAL DATA WRANGLING

4.1 INSTALLING THE R PACKAGES

This code chunk uses p_load() of the pacman package (stands for Package Manager) to check if the following packages are installed in the computer. The packages will then be launched into R.

pacman::p_load(tidyverse, sf, httr, jsonlite, rvest, tmap, rsample, GWmodel, SpatialML, ranger)

This code reads and processes a CSV file containing HDB resale transaction data. It filters the records to include only transactions between January 2023 and September 2024, focusing specifically on three-room flats. It then creates a new address column by combining the block number and street name, and extracts and formats the remaining lease period into years (remaining_lease_yr) and months (remaining_lease_mth). Finally, it displays the processed data for inspection.

4.2 Loading the Data

resale <- read_csv("data/aspatial/resale.csv") %>%
  filter(month >= "2023-01" & month <= "2024-09")
library(dplyr)
library(stringr)

resale_tidy <- resale %>%
  mutate(address = paste(block, street_name)) %>%
  mutate(
    remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2)),
    remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11))
  ) %>%
  filter(flat_type == "5 ROOM")
# View the result
print(resale_tidy)
# A tibble: 10,886 × 14
   month   town       flat_type block street_name    storey_range floor_area_sqm
   <chr>   <chr>      <chr>     <chr> <chr>          <chr>                 <dbl>
 1 2023-01 ANG MO KIO 5 ROOM    306   ANG MO KIO AV… 16 TO 18                123
 2 2023-01 ANG MO KIO 5 ROOM    306   ANG MO KIO AV… 04 TO 06                123
 3 2023-01 ANG MO KIO 5 ROOM    402   ANG MO KIO AV… 10 TO 12                119
 4 2023-01 ANG MO KIO 5 ROOM    259   ANG MO KIO AV… 13 TO 15                135
 5 2023-01 ANG MO KIO 5 ROOM    176   ANG MO KIO AV… 04 TO 06                119
 6 2023-01 ANG MO KIO 5 ROOM    618   ANG MO KIO AV… 13 TO 15                133
 7 2023-01 ANG MO KIO 5 ROOM    520   ANG MO KIO AV… 19 TO 21                118
 8 2023-01 ANG MO KIO 5 ROOM    700C  ANG MO KIO AV… 19 TO 21                111
 9 2023-01 ANG MO KIO 5 ROOM    714   ANG MO KIO AV… 10 TO 12                119
10 2023-01 ANG MO KIO 5 ROOM    253   ANG MO KIO ST… 04 TO 06                128
# ℹ 10,876 more rows
# ℹ 7 more variables: flat_model <chr>, lease_commence_date <dbl>,
#   remaining_lease <chr>, resale_price <dbl>, address <chr>,
#   remaining_lease_yr <int>, remaining_lease_mth <int>

4.3 Getting Coordinates

This code retrieves and processes geographical coordinates for a list of addresses. It begins by defining the get_coords function, which takes a list of addresses (add_list) and uses the OneMap API to obtain postal codes, latitudes, and longitudes for each address. The function creates a postal_coords data frame to store these results. For each address, if only one matching record is found, its coordinates are added. If multiple results are found, entries with “NIL” postal codes are ignored, and the first valid result is used. If no results are found, the coordinates are set to NA.

Once the coordinates are collected, they are saved as an .rds file. The code then reads this file, converts the data to a spatial (sf) object, specifying the latitude and longitude columns, and transforms it into the SVY21 (EPSG:3414) coordinate system, commonly used in Singapore. Finally, the structure of the new sf object is displayed.

resale_selected <- resale_tidy %>%
  filter(month == "2024-09")
add_list <- sort(unique(resale_selected$address))
get_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, 
                            postal = postal, 
                            latitude = lat, 
                            longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, 
                                postal = NA, 
                                latitude = NA, 
                                longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, 
                              postal = postal, 
                              latitude = lat, 
                              longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, 
                            postal = NA, 
                            latitude = NA, 
                            longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}
coords <- get_coords(add_list)
write_rds(coords, "data/geospatial/coords.rds")
# Load the spatial data (coords) from an .rds file
coords <- readRDS("data/geospatial/coords.rds")

This code performs spatial data processing and mapping with sf and tmap packages. Here’s a summary of each step:

  1. Joining Datasets: It starts by performing a left join between coords_sf (spatial coordinates of addresses) and resale_tidy (resale transaction data), using address as the common column. This merged data is expected to retain the sf geometry.

  2. Loading and Converting Geospatial Data: The mpsz dataset, containing Singapore’s subzone boundaries, is loaded and converted into an sf object. It’s reprojected to Singapore’s SVY21 coordinate system (EPSG:3414) for accurate spatial analysis.

  3. Mapping Subzones: A basic map of the mpsz dataset is created using tmap, displaying polygons representing Singapore’s subzones.

  4. Handling MULTIPOLYGONS: The code then:

    • Converts MULTIPOLYGON geometries in mpsz to individual POLYGON geometries.
    • Calculates the area of each polygon.
    • Selects the largest polygon within each subzone, identified by SUBZONE_N, retaining only the largest area to simplify further analysis.
  5. Filtering by Spatial Join: The joined_data dataset is filtered to include only points within the largest polygons of each subzone, using st_within() to spatially subset the points that fall inside the boundaries.

  6. Final Map Display: Finally, a map of the final_dataset is displayed with points (using tm_dots()), representing the filtered HDB resale transactions overlaid on the subzones map.

This approach ensures that only transactions within the primary subzone boundaries are included for analysis.

library(sf)

# Convert coords to an sf object, specifying latitude first, then longitude
coords_sf <- st_as_sf(coords, coords = c("longitude", "latitude"), crs = 4326)

# Convert to SVY21 (EPSG:3414) if needed
coords_sf <- st_transform(coords_sf, crs = 3414)

# Check the structure of the new sf object
print(coords_sf)
Simple feature collection with 487 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
                 address postal                  geometry
1         101 SIMEI ST 1 520101  POINT (41144.95 35890.2)
2     102 TECK WHYE LANE 680102 POINT (19216.72 39896.49)
3     102A PUNGGOL FIELD 821102 POINT (36064.38 42301.51)
4     103 CLEMENTI ST 14 120103 POINT (20861.59 33870.73)
5     106 TAMPINES ST 11 521106 POINT (40726.66 36629.85)
6    106A BIDADARI PK DR 341106 POINT (32222.45 35147.86)
7     106A PUNGGOL FIELD 821106 POINT (36349.34 42055.12)
8          107 JLN RAJAH 320107  POINT (30412.22 34452.8)
9  107A EDGEFIELD PLAINS 821107 POINT (36180.37 42120.79)
10    108A CANBERRA WALK 751108 POINT (27772.55 47711.53)

4.4 Filtering Using Boundaries

# Perform the left join on the common column, 'address'
joined_data <- coords_sf %>%
  left_join(resale_tidy, by = "address")

# Check the class to confirm it's still an sf object with geometry retained
print(class(joined_data)) # Should return `sf` if geometry is intact
[1] "sf"         "data.frame"
mpsz <- read_rds("data/geospatial/mpsz.rds")
mpsz
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
First 10 features:
                 SUBZONE_N SUBZONE_C       PLN_AREA_N PLN_AREA_C       REGION_N
1              MARINA EAST    MESZ01      MARINA EAST         ME CENTRAL REGION
2         INSTITUTION HILL    RVSZ05     RIVER VALLEY         RV CENTRAL REGION
3           ROBERTSON QUAY    SRSZ01  SINGAPORE RIVER         SR CENTRAL REGION
4  JURONG ISLAND AND BUKOM    WISZ01  WESTERN ISLANDS         WI    WEST REGION
5             FORT CANNING    MUSZ02           MUSEUM         MU CENTRAL REGION
6         MARINA EAST (MP)    MPSZ05    MARINE PARADE         MP CENTRAL REGION
7                   SUDONG    WISZ03  WESTERN ISLANDS         WI    WEST REGION
8                  SEMAKAU    WISZ02  WESTERN ISLANDS         WI    WEST REGION
9           SOUTHERN GROUP    SISZ02 SOUTHERN ISLANDS         SI CENTRAL REGION
10                 SENTOSA    SISZ01 SOUTHERN ISLANDS         SI CENTRAL REGION
   REGION_C                       geometry
1        CR MULTIPOLYGON (((33222.98 29...
2        CR MULTIPOLYGON (((28481.45 30...
3        CR MULTIPOLYGON (((28087.34 30...
4        WR MULTIPOLYGON (((14557.7 304...
5        CR MULTIPOLYGON (((29542.53 31...
6        CR MULTIPOLYGON (((35279.55 30...
7        WR MULTIPOLYGON (((15772.59 21...
8        WR MULTIPOLYGON (((19843.41 21...
9        CR MULTIPOLYGON (((30870.53 22...
10       CR MULTIPOLYGON (((26879.04 26...
mpsz_sf <- st_as_sf(mpsz, mpsz = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs = 3414)
library(tmap)

# Set tmap to static mode
tmap_mode("plot")

# Create a basic map with `mpsz`
tm_shape(mpsz_sf) +
  tm_polygons()

library(sf)
library(dplyr)

# Step 1: Convert MULTIPOLYGON geometries to individual POLYGON geometries
mpsz_polygons <- st_cast(mpsz, "POLYGON")

# Step 2: Calculate the area for each POLYGON
mpsz_polygons <- mpsz_polygons %>%
  mutate(area = st_area(geometry))

# Step 3: Group by original MULTIPOLYGON identifier and keep the largest POLYGON by area
# Replace `unique_id` with the column that uniquely identifies each MULTIPOLYGON
mpsz_largest_polygons <- mpsz_polygons %>%
  group_by(SUBZONE_N) %>%
  filter(area == max(area)) %>%
  ungroup()

# Print the result to verify
print(mpsz_largest_polygons)
Simple feature collection with 332 features and 7 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 19021.55 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
# A tibble: 332 × 8
   SUBZONE_N               SUBZONE_C PLN_AREA_N     PLN_AREA_C REGION_N REGION_C
   <chr>                   <chr>     <chr>          <chr>      <chr>    <chr>   
 1 MARINA EAST             MESZ01    MARINA EAST    ME         CENTRAL… CR      
 2 INSTITUTION HILL        RVSZ05    RIVER VALLEY   RV         CENTRAL… CR      
 3 ROBERTSON QUAY          SRSZ01    SINGAPORE RIV… SR         CENTRAL… CR      
 4 JURONG ISLAND AND BUKOM WISZ01    WESTERN ISLAN… WI         WEST RE… WR      
 5 FORT CANNING            MUSZ02    MUSEUM         MU         CENTRAL… CR      
 6 MARINA EAST (MP)        MPSZ05    MARINE PARADE  MP         CENTRAL… CR      
 7 SUDONG                  WISZ03    WESTERN ISLAN… WI         WEST RE… WR      
 8 SEMAKAU                 WISZ02    WESTERN ISLAN… WI         WEST RE… WR      
 9 SOUTHERN GROUP          SISZ02    SOUTHERN ISLA… SI         CENTRAL… CR      
10 SENTOSA                 SISZ01    SOUTHERN ISLA… SI         CENTRAL… CR      
# ℹ 322 more rows
# ℹ 2 more variables: geometry <POLYGON [m]>, area [m^2]
library(tmap)

# Set tmap to static mode
tmap_mode("plot")

# Create a basic map with `mpsz`
tm_shape(mpsz_largest_polygons) +
  tm_polygons()

final_dataset <- joined_data[st_within(joined_data, mpsz_largest_polygons, sparse = FALSE), ]
final_dataset <- st_as_sf(mpsz, mpsz = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs = 3414)
library(tmap)

# Set tmap to static mode
tmap_mode("plot")

# Create a basic map with `mpsz`
tm_shape(final_dataset) +
  tm_dots()

This code refines spatial data processing and visualization by ensuring consistent coordinate reference systems (CRS) for accurate spatial joins and mapping.

  1. CRS Alignment: First, both joined_data (the points data) and mpsz_largest_polygons (the largest subzone polygons) are transformed to the SVY21 coordinate system (EPSG:3414) for accurate spatial alignment.

  2. Spatial Join: A spatial join is performed using st_join with the st_within join type. This operation links each point in joined_data to its containing polygon in mpsz_largest_polygons, effectively adding polygon attributes to each point for further analysis.

  3. Verification: The resulting final_dataset is printed to verify that the join has successfully added polygon attributes to each point.

  4. Mapping in Static Mode: The tmap library is used to create a static map showing both subzone polygons and points. The polygons are displayed in light blue with a dark blue border, while the points are plotted in red, giving a clear view of which points fall within each subzone.

  5. Dynamic Map in WGS84 and SVY21: The code then reprojects joined_data between WGS84 (EPSG:4326) and SVY21 (EPSG:3414) to provide flexibility in viewing modes. Using tmap_mode("view"), it displays a dynamic, interactive map of joined_data, allowing for exploration of the spatial data.

This approach confirms the spatial relationships between points and polygons, facilitating further geospatial analysis and visualization.

library(sf)
library(dplyr)

# Step 1: Make sure both datasets are in the same CRS
# Assuming you want to use EPSG:3414 (SVY21)
joined_data <- st_transform(joined_data, crs = 3414)
mpsz_largest_polygons <- st_transform(mpsz_largest_polygons, crs = 3414)

# Step 2: Perform a spatial join to combine points with their respective boundaries
# This will add the polygon attributes to each point in joined_data
final_dataset <- st_join(joined_data, mpsz_largest_polygons, join = st_within)

# Step 3: Verify the results
print(final_dataset)
Simple feature collection with 2052 features and 22 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
              address postal   month          town flat_type block
1      101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2      101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3  102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4  102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5  102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6  102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
7  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
8  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
9  103 CLEMENTI ST 14 120103 2023-07      CLEMENTI    5 ROOM   103
10 103 CLEMENTI ST 14 120103 2024-07      CLEMENTI    5 ROOM   103
      street_name storey_range floor_area_sqm flat_model lease_commence_date
1      SIMEI ST 1     01 TO 03            128   Improved                1988
2      SIMEI ST 1     01 TO 03            122   Improved                1988
3  TECK WHYE LANE     01 TO 03            121   Improved                1989
4  TECK WHYE LANE     01 TO 03            121   Improved                1989
5   PUNGGOL FIELD     01 TO 03            110   Improved                2002
6   PUNGGOL FIELD     16 TO 18            110   Improved                2002
7  CLEMENTI ST 14     10 TO 12            121   Improved                1984
8  CLEMENTI ST 14     07 TO 09            123   Improved                1984
9  CLEMENTI ST 14     04 TO 06            121   Improved                1984
10 CLEMENTI ST 14     13 TO 15            123   Improved                1984
      remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1  64 years 03 months       680000                 64                   3
2   63 years 01 month       715000                 63                   1
3  64 years 11 months       520000                 64                  11
4  63 years 08 months       575000                 63                   8
5  77 years 04 months       603888                 77                   4
6  77 years 03 months       655000                 77                   3
7  60 years 08 months       808800                 60                   8
8  60 years 08 months       800000                 60                   8
9  60 years 05 months       758000                 60                   5
10 59 years 06 months       500000                 59                   6
       SUBZONE_N SUBZONE_C    PLN_AREA_N PLN_AREA_C          REGION_N REGION_C
1          SIMEI    TMSZ04      TAMPINES         TM       EAST REGION       ER
2          SIMEI    TMSZ04      TAMPINES         TM       EAST REGION       ER
3      TECK WHYE    CKSZ01 CHOA CHU KANG         CK       WEST REGION       WR
4      TECK WHYE    CKSZ01 CHOA CHU KANG         CK       WEST REGION       WR
5  PUNGGOL FIELD    PGSZ04       PUNGGOL         PG NORTH-EAST REGION      NER
6  PUNGGOL FIELD    PGSZ04       PUNGGOL         PG NORTH-EAST REGION      NER
7     SUNSET WAY    CLSZ02      CLEMENTI         CL       WEST REGION       WR
8     SUNSET WAY    CLSZ02      CLEMENTI         CL       WEST REGION       WR
9     SUNSET WAY    CLSZ02      CLEMENTI         CL       WEST REGION       WR
10    SUNSET WAY    CLSZ02      CLEMENTI         CL       WEST REGION       WR
              area                  geometry
1  2777496.6 [m^2]  POINT (41144.95 35890.2)
2  2777496.6 [m^2]  POINT (41144.95 35890.2)
3  1003202.6 [m^2] POINT (19216.72 39896.49)
4  1003202.6 [m^2] POINT (19216.72 39896.49)
5  1378710.5 [m^2] POINT (36064.38 42301.51)
6  1378710.5 [m^2] POINT (36064.38 42301.51)
7   963924.6 [m^2] POINT (20861.59 33870.73)
8   963924.6 [m^2] POINT (20861.59 33870.73)
9   963924.6 [m^2] POINT (20861.59 33870.73)
10  963924.6 [m^2] POINT (20861.59 33870.73)
library(tmap)

# Set tmap to plot mode for a static map
tmap_mode("plot")

# Plot both polygons and points
tm_shape(mpsz_largest_polygons) +
  tm_polygons(alpha = 0.3, col = "lightblue", border.col = "darkblue") +
tm_shape(final_dataset) +
  tm_dots(col = "red", size = 0.2) +
  tm_layout(main.title = "Points within Boundaries", legend.outside = TRUE)

# Transform joined_data to WGS84 coordinate system (EPSG:4326)
joined_data <- st_transform(joined_data, crs = st_crs(4326))
# Transform joined_data to SVY21 coordinate system (EPSG:3414)
joined_data <- st_transform(joined_data, crs = st_crs(3414))
tmap_mode("view")
tm_shape(joined_data)+
  tm_dots()

ADDING DERIVED VARIABLES

You loaded eldercare facility data from Data.gov.sg, transformed it to match the coordinate system of your HDB resale dataset, and then calculated the distance from each HDB resale location to the nearest eldercare facility. This distance was added as a new variable in the dataset to analyze accessibility to eldercare services for each resale location.

eldercare <- st_read(dsn = "data/geospatial/EldercareServicesSHP", layer = "ELDERCARE")
Reading layer `ELDERCARE' from data source 
  `C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\EldercareServicesSHP' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
eldercare <- st_transform(eldercare, crs = 3414)
library(sf)
library(dplyr)

# Ensure both datasets are in the same CRS
joined_data <- st_transform(joined_data, crs = 3414)
eldercare <- st_transform(eldercare, crs = 3414)

# Step 1: Find the nearest eldercare point for each HDB point
nearest_eldercare_index <- st_nearest_feature(joined_data, eldercare)

# Step 2: Calculate the distance to the nearest eldercare point
joined_data <- joined_data %>%
  mutate(distance_to_eldercare = as.numeric(st_distance(geometry, eldercare[nearest_eldercare_index, ], by_element = TRUE)))

# Check the updated dataset
print(joined_data)
Simple feature collection with 2052 features and 16 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
              address postal   month          town flat_type block
1      101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2      101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3  102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4  102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5  102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6  102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
7  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
8  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
9  103 CLEMENTI ST 14 120103 2023-07      CLEMENTI    5 ROOM   103
10 103 CLEMENTI ST 14 120103 2024-07      CLEMENTI    5 ROOM   103
      street_name storey_range floor_area_sqm flat_model lease_commence_date
1      SIMEI ST 1     01 TO 03            128   Improved                1988
2      SIMEI ST 1     01 TO 03            122   Improved                1988
3  TECK WHYE LANE     01 TO 03            121   Improved                1989
4  TECK WHYE LANE     01 TO 03            121   Improved                1989
5   PUNGGOL FIELD     01 TO 03            110   Improved                2002
6   PUNGGOL FIELD     16 TO 18            110   Improved                2002
7  CLEMENTI ST 14     10 TO 12            121   Improved                1984
8  CLEMENTI ST 14     07 TO 09            123   Improved                1984
9  CLEMENTI ST 14     04 TO 06            121   Improved                1984
10 CLEMENTI ST 14     13 TO 15            123   Improved                1984
      remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1  64 years 03 months       680000                 64                   3
2   63 years 01 month       715000                 63                   1
3  64 years 11 months       520000                 64                  11
4  63 years 08 months       575000                 63                   8
5  77 years 04 months       603888                 77                   4
6  77 years 03 months       655000                 77                   3
7  60 years 08 months       808800                 60                   8
8  60 years 08 months       800000                 60                   8
9  60 years 05 months       758000                 60                   5
10 59 years 06 months       500000                 59                   6
                    geometry distance_to_eldercare
1   POINT (41144.95 35890.2)              758.1380
2   POINT (41144.95 35890.2)              758.1380
3  POINT (19216.72 39896.49)              327.6085
4  POINT (19216.72 39896.49)              327.6085
5  POINT (36064.38 42301.51)              701.7365
6  POINT (36064.38 42301.51)              701.7365
7  POINT (20861.59 33870.73)              378.3587
8  POINT (20861.59 33870.73)              378.3587
9  POINT (20861.59 33870.73)              378.3587
10 POINT (20861.59 33870.73)              378.3587

This code calculates the distance from each HDB resale location to Raffles Place, which represents Singapore’s Central Business District (CBD). First, the coordinates for Raffles Place (103.8514, 1.2839 in WGS84) are defined and then transformed to match the CRS of the HDB dataset (EPSG:3414). Using st_distance, the distance from each HDB point to the CBD is computed and added to joined_data as a new column, distance_to_cbd, to assess proximity to the CBD for each resale property.

library(sf)
library(dplyr)

# Ensure joined_data is in the projected CRS (e.g., EPSG:3414)
joined_data <- st_transform(joined_data, crs = 3414)

# Define the CBD point (Raffles Place coordinates in WGS84)
cbd_point <- st_sfc(st_point(c(103.8514, 1.2839)), crs = 4326)  # WGS84

# Transform CBD point to the same CRS as joined_data (EPSG:3414)
cbd_point <- st_transform(cbd_point, crs = 3414)

# Calculate the distance from each HDB point to the CBD point
joined_data <- joined_data %>%
  mutate(distance_to_cbd = as.numeric(st_distance(geometry, cbd_point)[, 1]))

# View the updated dataset with distance to CBD
print(joined_data)
Simple feature collection with 2052 features and 17 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
              address postal   month          town flat_type block
1      101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2      101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3  102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4  102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5  102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6  102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
7  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
8  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
9  103 CLEMENTI ST 14 120103 2023-07      CLEMENTI    5 ROOM   103
10 103 CLEMENTI ST 14 120103 2024-07      CLEMENTI    5 ROOM   103
      street_name storey_range floor_area_sqm flat_model lease_commence_date
1      SIMEI ST 1     01 TO 03            128   Improved                1988
2      SIMEI ST 1     01 TO 03            122   Improved                1988
3  TECK WHYE LANE     01 TO 03            121   Improved                1989
4  TECK WHYE LANE     01 TO 03            121   Improved                1989
5   PUNGGOL FIELD     01 TO 03            110   Improved                2002
6   PUNGGOL FIELD     16 TO 18            110   Improved                2002
7  CLEMENTI ST 14     10 TO 12            121   Improved                1984
8  CLEMENTI ST 14     07 TO 09            123   Improved                1984
9  CLEMENTI ST 14     04 TO 06            121   Improved                1984
10 CLEMENTI ST 14     13 TO 15            123   Improved                1984
      remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1  64 years 03 months       680000                 64                   3
2   63 years 01 month       715000                 63                   1
3  64 years 11 months       520000                 64                  11
4  63 years 08 months       575000                 63                   8
5  77 years 04 months       603888                 77                   4
6  77 years 03 months       655000                 77                   3
7  60 years 08 months       808800                 60                   8
8  60 years 08 months       800000                 60                   8
9  60 years 05 months       758000                 60                   5
10 59 years 06 months       500000                 59                   6
                    geometry distance_to_eldercare distance_to_cbd
1   POINT (41144.95 35890.2)              758.1380        12790.41
2   POINT (41144.95 35890.2)              758.1380        12790.41
3  POINT (19216.72 39896.49)              327.6085        14923.59
4  POINT (19216.72 39896.49)              327.6085        14923.59
5  POINT (36064.38 42301.51)              701.7365        14076.30
6  POINT (36064.38 42301.51)              701.7365        14076.30
7  POINT (20861.59 33870.73)              378.3587        10101.36
8  POINT (20861.59 33870.73)              378.3587        10101.36
9  POINT (20861.59 33870.73)              378.3587        10101.36
10 POINT (20861.59 33870.73)              378.3587        10101.36

This code calculates the distance from each HDB resale location to the nearest hawker center. It first ensures that both joined_data (HDB resale data) and hawker (hawker center locations) are in the same coordinate reference system (EPSG:3414). Then, using st_nearest_feature, it identifies the index of the closest hawker center for each HDB location. The st_distance function calculates the distance from each HDB point to its nearest hawker center, adding this as a new column, distance_to_hawker, in the joined_data dataset. This additional variable enables analysis of proximity to hawker centers for each resale property.

hawker <- st_read("data/geospatial/HawkerCentresKML.kml")
Reading layer `HAWKERCENTRE' from data source 
  `C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\HawkerCentresKML.kml' 
  using driver `KML'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
library(sf)
library(dplyr)

# Ensure both datasets are in the same CRS
joined_data <- st_transform(joined_data, crs = 3414)
hawker <- st_transform(hawker, crs = 3414)  # Assuming hawker dataset is in sf format

# Step 1: Find the index of the nearest hawker center for each HDB point
nearest_hawker_index <- st_nearest_feature(joined_data, hawker)

# Step 2: Calculate the distance to the nearest hawker center
joined_data <- joined_data %>%
  mutate(distance_to_hawker = as.numeric(st_distance(geometry, hawker[nearest_hawker_index, ], by_element = TRUE)))

# View the updated dataset with the new distance column
print(joined_data)
Simple feature collection with 2052 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
              address postal   month          town flat_type block
1      101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2      101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3  102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4  102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5  102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6  102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
7  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
8  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
9  103 CLEMENTI ST 14 120103 2023-07      CLEMENTI    5 ROOM   103
10 103 CLEMENTI ST 14 120103 2024-07      CLEMENTI    5 ROOM   103
      street_name storey_range floor_area_sqm flat_model lease_commence_date
1      SIMEI ST 1     01 TO 03            128   Improved                1988
2      SIMEI ST 1     01 TO 03            122   Improved                1988
3  TECK WHYE LANE     01 TO 03            121   Improved                1989
4  TECK WHYE LANE     01 TO 03            121   Improved                1989
5   PUNGGOL FIELD     01 TO 03            110   Improved                2002
6   PUNGGOL FIELD     16 TO 18            110   Improved                2002
7  CLEMENTI ST 14     10 TO 12            121   Improved                1984
8  CLEMENTI ST 14     07 TO 09            123   Improved                1984
9  CLEMENTI ST 14     04 TO 06            121   Improved                1984
10 CLEMENTI ST 14     13 TO 15            123   Improved                1984
      remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1  64 years 03 months       680000                 64                   3
2   63 years 01 month       715000                 63                   1
3  64 years 11 months       520000                 64                  11
4  63 years 08 months       575000                 63                   8
5  77 years 04 months       603888                 77                   4
6  77 years 03 months       655000                 77                   3
7  60 years 08 months       808800                 60                   8
8  60 years 08 months       800000                 60                   8
9  60 years 05 months       758000                 60                   5
10 59 years 06 months       500000                 59                   6
                    geometry distance_to_eldercare distance_to_cbd
1   POINT (41144.95 35890.2)              758.1380        12790.41
2   POINT (41144.95 35890.2)              758.1380        12790.41
3  POINT (19216.72 39896.49)              327.6085        14923.59
4  POINT (19216.72 39896.49)              327.6085        14923.59
5  POINT (36064.38 42301.51)              701.7365        14076.30
6  POINT (36064.38 42301.51)              701.7365        14076.30
7  POINT (20861.59 33870.73)              378.3587        10101.36
8  POINT (20861.59 33870.73)              378.3587        10101.36
9  POINT (20861.59 33870.73)              378.3587        10101.36
10 POINT (20861.59 33870.73)              378.3587        10101.36
   distance_to_hawker
1            924.5228
2            924.5228
3           1333.7754
4           1333.7754
5           1040.3345
6           1040.3345
7            927.6112
8            927.6112
9            927.6112
10           927.6112

This code loads shopping mall location data from a CSV file (sourced from Kaggle), converts it into a spatial sf object using the latitude and longitude columns, and sets the coordinate reference system to WGS84 (EPSG:4326). Following the same approach as with the eldercare and hawker center data, it ensures both the HDB dataset and shopping mall locations are in the same coordinate system, enabling spatial analysis. Using spatial functions, it then calculates the distance from each HDB resale location to the nearest shopping mall, adding this as a new column to the HDB dataset for proximity analysis.

library(sf)
library(dplyr)

# Step 1: Read the CSV file (replace 'malls.csv' with the path to your file)
malls_data <- read.csv("data/aspatial/shopping_mall_coordinates.csv")

# Step 2: Convert to sf object using latitude and longitude columns
# Assuming the latitude and longitude columns are named "lat" and "long"
malls_sf <- malls_data %>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)  # WGS84 CRS

# View the resulting sf object to confirm it has geometry
print(malls_sf)
Simple feature collection with 155 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6972 ymin: 1.274588 xmax: 103.99 ymax: 1.448231
Geodetic CRS:  WGS 84
First 10 features:
                        Mall.Name                  geometry
1                          100 AM POINT (103.8435 1.274588)
2                    313@Somerset POINT (103.8384 1.301014)
3                          Aperia POINT (103.8643 1.310474)
4  Balestier Hill Shopping Centre POINT (103.8426 1.325596)
5                      Bugis Cube POINT (103.8556 1.298141)
6                  Bugis Junction POINT (103.8554 1.299113)
7                          Bugis+ POINT (103.8552 1.300952)
8                  Capitol Piazza POINT (103.8513 1.293079)
9      Cathay Cineleisure Orchard POINT (103.8365 1.301464)
10                The Centrepoint    POINT (103.84 1.30145)
malls_sf <- st_transform(malls_sf, crs = 3414)
library(sf)
library(dplyr)

# Ensure both datasets are in the same CRS (e.g., EPSG:3414 for SVY21 in Singapore)
joined_data <- st_transform(joined_data, crs = 3414)
malls_sf <- st_transform(malls_sf, crs = 3414)

# Step 1: Find the index of the nearest mall for each HDB point
nearest_mall_index <- st_nearest_feature(joined_data, malls_sf)

# Step 2: Calculate the distance to the nearest mall
joined_data <- joined_data %>%
  mutate(distance_to_mall = as.numeric(st_distance(geometry, malls_sf[nearest_mall_index, ], by_element = TRUE)))

# View the updated dataset with the new distance column
print(joined_data)
Simple feature collection with 2052 features and 19 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
              address postal   month          town flat_type block
1      101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2      101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3  102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4  102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5  102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6  102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
7  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
8  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
9  103 CLEMENTI ST 14 120103 2023-07      CLEMENTI    5 ROOM   103
10 103 CLEMENTI ST 14 120103 2024-07      CLEMENTI    5 ROOM   103
      street_name storey_range floor_area_sqm flat_model lease_commence_date
1      SIMEI ST 1     01 TO 03            128   Improved                1988
2      SIMEI ST 1     01 TO 03            122   Improved                1988
3  TECK WHYE LANE     01 TO 03            121   Improved                1989
4  TECK WHYE LANE     01 TO 03            121   Improved                1989
5   PUNGGOL FIELD     01 TO 03            110   Improved                2002
6   PUNGGOL FIELD     16 TO 18            110   Improved                2002
7  CLEMENTI ST 14     10 TO 12            121   Improved                1984
8  CLEMENTI ST 14     07 TO 09            123   Improved                1984
9  CLEMENTI ST 14     04 TO 06            121   Improved                1984
10 CLEMENTI ST 14     13 TO 15            123   Improved                1984
      remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1  64 years 03 months       680000                 64                   3
2   63 years 01 month       715000                 63                   1
3  64 years 11 months       520000                 64                  11
4  63 years 08 months       575000                 63                   8
5  77 years 04 months       603888                 77                   4
6  77 years 03 months       655000                 77                   3
7  60 years 08 months       808800                 60                   8
8  60 years 08 months       800000                 60                   8
9  60 years 05 months       758000                 60                   5
10 59 years 06 months       500000                 59                   6
                    geometry distance_to_eldercare distance_to_cbd
1   POINT (41144.95 35890.2)              758.1380        12790.41
2   POINT (41144.95 35890.2)              758.1380        12790.41
3  POINT (19216.72 39896.49)              327.6085        14923.59
4  POINT (19216.72 39896.49)              327.6085        14923.59
5  POINT (36064.38 42301.51)              701.7365        14076.30
6  POINT (36064.38 42301.51)              701.7365        14076.30
7  POINT (20861.59 33870.73)              378.3587        10101.36
8  POINT (20861.59 33870.73)              378.3587        10101.36
9  POINT (20861.59 33870.73)              378.3587        10101.36
10 POINT (20861.59 33870.73)              378.3587        10101.36
   distance_to_hawker distance_to_mall
1            924.5228         271.0456
2            924.5228         271.0456
3           1333.7754         390.7374
4           1333.7754         390.7374
5           1040.3345         748.0958
6           1040.3345         748.0958
7            927.6112        1001.6526
8            927.6112        1001.6526
9            927.6112        1001.6526
10           927.6112        1001.6526

This code processes train station data from a KML file, converting it to a spatial format and setting the coordinate reference system to EPSG:3414 (SVY21) for Singapore. It begins by checking for and identifying any invalid geometries in train_stations, printing them for review. After confirming both train_stations and joined_data (HDB resale data) are in the same CRS, the code calculates the distance from each HDB resale location to the nearest train station using spatial functions. This distance is added as a new column, distance_to_mrt, in the joined_data dataset, providing a measure of proximity to MRT stations for further analysis.

train_stations <- st_read(dsn = "data/geospatial/LTAMRTStationExitKML.kml")
Reading layer `MRT_EXITS' from data source 
  `C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\LTAMRTStationExitKML.kml' 
  using driver `KML'
Simple feature collection with 563 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6368 ymin: 1.264972 xmax: 103.9893 ymax: 1.449157
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
train_stations <- st_transform(train_stations, crs = 3414)
library(sf)
library(dplyr)

# Ensure train_stations is in the desired CRS (e.g., EPSG:3414)
train_stations <- st_transform(train_stations, crs = 3414)

# Step 1: Identify invalid polygons
invalid_polygons <- train_stations %>%
  filter(!st_is_valid(geometry))

# Step 2: View invalid polygons
print(invalid_polygons)
Simple feature collection with 0 features and 2 fields
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Projected CRS: SVY21 / Singapore TM
[1] Name        Description geometry   
<0 rows> (or 0-length row.names)
# Ensure both datasets are in the same CRS (e.g., EPSG:3414 for SVY21 in Singapore)
joined_data <- st_transform(joined_data, crs = 3414)
train_sf <- st_transform(train_stations, crs = 3414)

# Step 1: Find the index of the nearest mall for each HDB point
nearest_mrt_index <- st_nearest_feature(joined_data, train_sf)

# Step 2: Calculate the distance to the nearest mall
joined_data <- joined_data %>%
  mutate(distance_to_mrt = as.numeric(st_distance(geometry, train_sf[nearest_mrt_index, ], by_element = TRUE)))

# View the updated dataset with the new distance column
print(joined_data)
Simple feature collection with 2052 features and 20 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
              address postal   month          town flat_type block
1      101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2      101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3  102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4  102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5  102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6  102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
7  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
8  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
9  103 CLEMENTI ST 14 120103 2023-07      CLEMENTI    5 ROOM   103
10 103 CLEMENTI ST 14 120103 2024-07      CLEMENTI    5 ROOM   103
      street_name storey_range floor_area_sqm flat_model lease_commence_date
1      SIMEI ST 1     01 TO 03            128   Improved                1988
2      SIMEI ST 1     01 TO 03            122   Improved                1988
3  TECK WHYE LANE     01 TO 03            121   Improved                1989
4  TECK WHYE LANE     01 TO 03            121   Improved                1989
5   PUNGGOL FIELD     01 TO 03            110   Improved                2002
6   PUNGGOL FIELD     16 TO 18            110   Improved                2002
7  CLEMENTI ST 14     10 TO 12            121   Improved                1984
8  CLEMENTI ST 14     07 TO 09            123   Improved                1984
9  CLEMENTI ST 14     04 TO 06            121   Improved                1984
10 CLEMENTI ST 14     13 TO 15            123   Improved                1984
      remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1  64 years 03 months       680000                 64                   3
2   63 years 01 month       715000                 63                   1
3  64 years 11 months       520000                 64                  11
4  63 years 08 months       575000                 63                   8
5  77 years 04 months       603888                 77                   4
6  77 years 03 months       655000                 77                   3
7  60 years 08 months       808800                 60                   8
8  60 years 08 months       800000                 60                   8
9  60 years 05 months       758000                 60                   5
10 59 years 06 months       500000                 59                   6
                    geometry distance_to_eldercare distance_to_cbd
1   POINT (41144.95 35890.2)              758.1380        12790.41
2   POINT (41144.95 35890.2)              758.1380        12790.41
3  POINT (19216.72 39896.49)              327.6085        14923.59
4  POINT (19216.72 39896.49)              327.6085        14923.59
5  POINT (36064.38 42301.51)              701.7365        14076.30
6  POINT (36064.38 42301.51)              701.7365        14076.30
7  POINT (20861.59 33870.73)              378.3587        10101.36
8  POINT (20861.59 33870.73)              378.3587        10101.36
9  POINT (20861.59 33870.73)              378.3587        10101.36
10 POINT (20861.59 33870.73)              378.3587        10101.36
   distance_to_hawker distance_to_mall distance_to_mrt
1            924.5228         271.0456       332.54769
2            924.5228         271.0456       332.54769
3           1333.7754         390.7374        85.37721
4           1333.7754         390.7374        85.37721
5           1040.3345         748.0958        36.02671
6           1040.3345         748.0958        36.02671
7            927.6112        1001.6526       903.86491
8            927.6112        1001.6526       903.86491
9            927.6112        1001.6526       903.86491
10           927.6112        1001.6526       903.86491

This code loads park data from a KML file, converts it to an sf spatial format, and sets the coordinate reference system to EPSG:3414 (SVY21) for consistency with the HDB resale data. After ensuring that both joined_data (HDB data) and parks are in the same CRS, it calculates the distance from each HDB resale location to the nearest park. This distance is added as a new column, distance_to_parks, in the joined_data dataset, allowing for analysis of proximity to parks for each resale location.

parks <- st_read(dsn = "data/geospatial/Parks.kml")
Reading layer `NATIONALPARKS' from data source 
  `C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\Parks.kml' 
  using driver `KML'
Simple feature collection with 430 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
parks<- st_transform(parks, crs = 3414)
# Ensure both datasets are in the same CRS (e.g., EPSG:3414 for SVY21 in Singapore)
joined_data <- st_transform(joined_data, crs = 3414)
parks_sf <- st_transform(parks, crs = 3414)

# Step 1: Find the index of the nearest mall for each HDB point
nearest_parks_index <- st_nearest_feature(joined_data, parks_sf)

# Step 2: Calculate the distance to the nearest mall
joined_data <- joined_data %>%
  mutate(distance_to_parks = as.numeric(st_distance(geometry, parks_sf[nearest_parks_index, ], by_element = TRUE)))

# View the updated dataset with the new distance column
print(joined_data)
Simple feature collection with 2052 features and 21 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
              address postal   month          town flat_type block
1      101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2      101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3  102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4  102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5  102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6  102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
7  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
8  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
9  103 CLEMENTI ST 14 120103 2023-07      CLEMENTI    5 ROOM   103
10 103 CLEMENTI ST 14 120103 2024-07      CLEMENTI    5 ROOM   103
      street_name storey_range floor_area_sqm flat_model lease_commence_date
1      SIMEI ST 1     01 TO 03            128   Improved                1988
2      SIMEI ST 1     01 TO 03            122   Improved                1988
3  TECK WHYE LANE     01 TO 03            121   Improved                1989
4  TECK WHYE LANE     01 TO 03            121   Improved                1989
5   PUNGGOL FIELD     01 TO 03            110   Improved                2002
6   PUNGGOL FIELD     16 TO 18            110   Improved                2002
7  CLEMENTI ST 14     10 TO 12            121   Improved                1984
8  CLEMENTI ST 14     07 TO 09            123   Improved                1984
9  CLEMENTI ST 14     04 TO 06            121   Improved                1984
10 CLEMENTI ST 14     13 TO 15            123   Improved                1984
      remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1  64 years 03 months       680000                 64                   3
2   63 years 01 month       715000                 63                   1
3  64 years 11 months       520000                 64                  11
4  63 years 08 months       575000                 63                   8
5  77 years 04 months       603888                 77                   4
6  77 years 03 months       655000                 77                   3
7  60 years 08 months       808800                 60                   8
8  60 years 08 months       800000                 60                   8
9  60 years 05 months       758000                 60                   5
10 59 years 06 months       500000                 59                   6
                    geometry distance_to_eldercare distance_to_cbd
1   POINT (41144.95 35890.2)              758.1380        12790.41
2   POINT (41144.95 35890.2)              758.1380        12790.41
3  POINT (19216.72 39896.49)              327.6085        14923.59
4  POINT (19216.72 39896.49)              327.6085        14923.59
5  POINT (36064.38 42301.51)              701.7365        14076.30
6  POINT (36064.38 42301.51)              701.7365        14076.30
7  POINT (20861.59 33870.73)              378.3587        10101.36
8  POINT (20861.59 33870.73)              378.3587        10101.36
9  POINT (20861.59 33870.73)              378.3587        10101.36
10 POINT (20861.59 33870.73)              378.3587        10101.36
   distance_to_hawker distance_to_mall distance_to_mrt distance_to_parks
1            924.5228         271.0456       332.54769          256.4781
2            924.5228         271.0456       332.54769          256.4781
3           1333.7754         390.7374        85.37721          341.9659
4           1333.7754         390.7374        85.37721          341.9659
5           1040.3345         748.0958        36.02671          828.0045
6           1040.3345         748.0958        36.02671          828.0045
7            927.6112        1001.6526       903.86491          459.1670
8            927.6112        1001.6526       903.86491          459.1670
9            927.6112        1001.6526       903.86491          459.1670
10           927.6112        1001.6526       903.86491          459.1670

This code calculates the number of preschools and schools within a 400-meter radius of each HDB resale location. It first loads and transforms the preschool and school zone data into the same coordinate reference system (EPSG:3414) as the HDB data in joined_data. Then, it creates a 400-meter buffer around each HDB point to represent a walkable area. Using st_intersects, it identifies which preschools and schools fall within each buffer and counts them. These counts are added to joined_data as new columns, preschools_within_400m and schools_within_400m, to support analysis of nearby educational facilities for each HDB resale property.

preschools <- st_read("data/geospatial/PreSchoolsLocation.kml")
Reading layer `PRESCHOOLS_LOCATION' from data source 
  `C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\PreSchoolsLocation.kml' 
  using driver `KML'
Simple feature collection with 2290 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6878 ymin: 1.247759 xmax: 103.9897 ymax: 1.462134
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
school_zones <- st_read(dsn = "data/geospatial/SchoolZone_Jan2024/SchoolZone_Jan2024", layer="SCHOOLZONE") 
Reading layer `SCHOOLZONE' from data source 
  `C:\loriellemalveda\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\SchoolZone_Jan2024\SchoolZone_Jan2024' 
  using driver `ESRI Shapefile'
Simple feature collection with 211 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 11717.22 ymin: 28358.21 xmax: 42850.04 ymax: 48798.1
Projected CRS: SVY21
school_zones <- st_transform(school_zones, crs = 3414)
# Ensure both datasets (HDB points and preschools/schools) are in the same CRS
joined_data <- st_transform(joined_data, crs = 3414)  # Assuming HDB data is in joined_data
preschools <- st_transform(preschools, crs = 3414)    # Assuming preschools is an sf object
schools <- st_transform(school_zones, crs = 3414)          # Assuming schools is an sf object

# Create a 500m buffer around each HDB point
hdb_buffers <- st_buffer(joined_data, dist = 400)
# Step 2: Count preschools within each HDB buffer
# Use st_intersects to identify preschools within each buffer
preschools_within_buffer <- st_intersects(hdb_buffers, preschools)
joined_data$preschools_within_400m <- sapply(preschools_within_buffer, length)

# Step 3: Count schools within each HDB buffer
schools_within_buffer <- st_intersects(hdb_buffers, schools)
joined_data$schools_within_400m <- sapply(schools_within_buffer, length)

# View the updated dataset to check counts
print(joined_data)
Simple feature collection with 2052 features and 23 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11755.72 ymin: 28457.97 xmax: 42441.24 ymax: 48339.17
Projected CRS: SVY21 / Singapore TM
First 10 features:
              address postal   month          town flat_type block
1      101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2      101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3  102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4  102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5  102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6  102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
7  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
8  103 CLEMENTI ST 14 120103 2023-04      CLEMENTI    5 ROOM   103
9  103 CLEMENTI ST 14 120103 2023-07      CLEMENTI    5 ROOM   103
10 103 CLEMENTI ST 14 120103 2024-07      CLEMENTI    5 ROOM   103
      street_name storey_range floor_area_sqm flat_model lease_commence_date
1      SIMEI ST 1     01 TO 03            128   Improved                1988
2      SIMEI ST 1     01 TO 03            122   Improved                1988
3  TECK WHYE LANE     01 TO 03            121   Improved                1989
4  TECK WHYE LANE     01 TO 03            121   Improved                1989
5   PUNGGOL FIELD     01 TO 03            110   Improved                2002
6   PUNGGOL FIELD     16 TO 18            110   Improved                2002
7  CLEMENTI ST 14     10 TO 12            121   Improved                1984
8  CLEMENTI ST 14     07 TO 09            123   Improved                1984
9  CLEMENTI ST 14     04 TO 06            121   Improved                1984
10 CLEMENTI ST 14     13 TO 15            123   Improved                1984
      remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1  64 years 03 months       680000                 64                   3
2   63 years 01 month       715000                 63                   1
3  64 years 11 months       520000                 64                  11
4  63 years 08 months       575000                 63                   8
5  77 years 04 months       603888                 77                   4
6  77 years 03 months       655000                 77                   3
7  60 years 08 months       808800                 60                   8
8  60 years 08 months       800000                 60                   8
9  60 years 05 months       758000                 60                   5
10 59 years 06 months       500000                 59                   6
                    geometry distance_to_eldercare distance_to_cbd
1   POINT (41144.95 35890.2)              758.1380        12790.41
2   POINT (41144.95 35890.2)              758.1380        12790.41
3  POINT (19216.72 39896.49)              327.6085        14923.59
4  POINT (19216.72 39896.49)              327.6085        14923.59
5  POINT (36064.38 42301.51)              701.7365        14076.30
6  POINT (36064.38 42301.51)              701.7365        14076.30
7  POINT (20861.59 33870.73)              378.3587        10101.36
8  POINT (20861.59 33870.73)              378.3587        10101.36
9  POINT (20861.59 33870.73)              378.3587        10101.36
10 POINT (20861.59 33870.73)              378.3587        10101.36
   distance_to_hawker distance_to_mall distance_to_mrt distance_to_parks
1            924.5228         271.0456       332.54769          256.4781
2            924.5228         271.0456       332.54769          256.4781
3           1333.7754         390.7374        85.37721          341.9659
4           1333.7754         390.7374        85.37721          341.9659
5           1040.3345         748.0958        36.02671          828.0045
6           1040.3345         748.0958        36.02671          828.0045
7            927.6112        1001.6526       903.86491          459.1670
8            927.6112        1001.6526       903.86491          459.1670
9            927.6112        1001.6526       903.86491          459.1670
10           927.6112        1001.6526       903.86491          459.1670
   preschools_within_400m schools_within_400m
1                       5                   1
2                       5                   1
3                       7                   0
4                       7                   0
5                      10                   1
6                      10                   1
7                       3                   0
8                       3                   0
9                       3                   0
10                      3                   0
write_rds(joined_data, "data/rds/dataset_no_log_transformed.rds") 
dataset_no_log_transformed <- read_rds("data/rds/dataset_no_log_transformed.rds")

This code converts the distance variables in joined_data from meters to kilometers for easier interpretation. It divides each distance column—distance_to_eldercare, distance_to_cbd, distance_to_hawker, distance_to_mall, distance_to_mrt, and distance_to_parks—by 1,000, updating these values to represent distances in kilometers. This transformation simplifies analysis and comparison across distance-related metrics.

library(dplyr)

# Convert distance variables from meters to kilometers
joined_data <- joined_data %>%
  mutate(
    distance_to_eldercare = distance_to_eldercare / 1000,
    distance_to_cbd = distance_to_cbd / 1000,
    distance_to_hawker = distance_to_hawker / 1000,
    distance_to_mall = distance_to_mall / 1000,
    distance_to_mrt = distance_to_mrt / 1000,
    distance_to_parks = distance_to_parks / 1000
  )

DATASET VARIABLES

In this dataset, several derived variables have been created from the original data, representing factors that could be relevant to the predictive model:

  1. Total Remaining Lease (in months): A new variable, total_remaining_lease_mths, calculates the total remaining lease time by converting years and months into a single metric in months. This helps standardize the remaining lease duration for analysis.

  2. Years Since Lease Commencement: The years_since_lease_commence variable calculates the age of the lease from its commencement up to the year 2024, providing insight into the property’s age.

  3. Median Storey: For properties with a storey range, the median_storey variable estimates the average storey level by taking the midpoint between the minimum and maximum storeys in the range, giving a useful measure of building height.

These derived variables enhance the dataset by capturing relevant factors that can potentially impact property value in the predictive model.

library(dplyr)

# Add a new column for total remaining lease in months
joined_data <- joined_data %>%
  mutate(total_remaining_lease_mths = (remaining_lease_yr * 12) + remaining_lease_mth)

# View the updated dataset to confirm the new column
head(joined_data)
Simple feature collection with 6 features and 24 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 19216.72 ymin: 35890.2 xmax: 41144.95 ymax: 42301.51
Projected CRS: SVY21 / Singapore TM
             address postal   month          town flat_type block
1     101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2     101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5 102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6 102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
     street_name storey_range floor_area_sqm flat_model lease_commence_date
1     SIMEI ST 1     01 TO 03            128   Improved                1988
2     SIMEI ST 1     01 TO 03            122   Improved                1988
3 TECK WHYE LANE     01 TO 03            121   Improved                1989
4 TECK WHYE LANE     01 TO 03            121   Improved                1989
5  PUNGGOL FIELD     01 TO 03            110   Improved                2002
6  PUNGGOL FIELD     16 TO 18            110   Improved                2002
     remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months       680000                 64                   3
2  63 years 01 month       715000                 63                   1
3 64 years 11 months       520000                 64                  11
4 63 years 08 months       575000                 63                   8
5 77 years 04 months       603888                 77                   4
6 77 years 03 months       655000                 77                   3
                   geometry distance_to_eldercare distance_to_cbd
1  POINT (41144.95 35890.2)             0.7581380        12.79041
2  POINT (41144.95 35890.2)             0.7581380        12.79041
3 POINT (19216.72 39896.49)             0.3276085        14.92359
4 POINT (19216.72 39896.49)             0.3276085        14.92359
5 POINT (36064.38 42301.51)             0.7017365        14.07630
6 POINT (36064.38 42301.51)             0.7017365        14.07630
  distance_to_hawker distance_to_mall distance_to_mrt distance_to_parks
1          0.9245228        0.2710456      0.33254769         0.2564781
2          0.9245228        0.2710456      0.33254769         0.2564781
3          1.3337754        0.3907374      0.08537721         0.3419659
4          1.3337754        0.3907374      0.08537721         0.3419659
5          1.0403345        0.7480958      0.03602671         0.8280045
6          1.0403345        0.7480958      0.03602671         0.8280045
  preschools_within_400m schools_within_400m total_remaining_lease_mths
1                      5                   1                        771
2                      5                   1                        757
3                      7                   0                        779
4                      7                   0                        764
5                     10                   1                        928
6                     10                   1                        927
library(dplyr)

# Add a new column calculating the age of the lease based on 2024
joined_data <- joined_data %>%
  mutate(years_since_lease_commence = 2024 - lease_commence_date)

# View the updated dataset
head(joined_data)
Simple feature collection with 6 features and 25 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 19216.72 ymin: 35890.2 xmax: 41144.95 ymax: 42301.51
Projected CRS: SVY21 / Singapore TM
             address postal   month          town flat_type block
1     101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2     101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5 102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6 102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
     street_name storey_range floor_area_sqm flat_model lease_commence_date
1     SIMEI ST 1     01 TO 03            128   Improved                1988
2     SIMEI ST 1     01 TO 03            122   Improved                1988
3 TECK WHYE LANE     01 TO 03            121   Improved                1989
4 TECK WHYE LANE     01 TO 03            121   Improved                1989
5  PUNGGOL FIELD     01 TO 03            110   Improved                2002
6  PUNGGOL FIELD     16 TO 18            110   Improved                2002
     remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months       680000                 64                   3
2  63 years 01 month       715000                 63                   1
3 64 years 11 months       520000                 64                  11
4 63 years 08 months       575000                 63                   8
5 77 years 04 months       603888                 77                   4
6 77 years 03 months       655000                 77                   3
                   geometry distance_to_eldercare distance_to_cbd
1  POINT (41144.95 35890.2)             0.7581380        12.79041
2  POINT (41144.95 35890.2)             0.7581380        12.79041
3 POINT (19216.72 39896.49)             0.3276085        14.92359
4 POINT (19216.72 39896.49)             0.3276085        14.92359
5 POINT (36064.38 42301.51)             0.7017365        14.07630
6 POINT (36064.38 42301.51)             0.7017365        14.07630
  distance_to_hawker distance_to_mall distance_to_mrt distance_to_parks
1          0.9245228        0.2710456      0.33254769         0.2564781
2          0.9245228        0.2710456      0.33254769         0.2564781
3          1.3337754        0.3907374      0.08537721         0.3419659
4          1.3337754        0.3907374      0.08537721         0.3419659
5          1.0403345        0.7480958      0.03602671         0.8280045
6          1.0403345        0.7480958      0.03602671         0.8280045
  preschools_within_400m schools_within_400m total_remaining_lease_mths
1                      5                   1                        771
2                      5                   1                        757
3                      7                   0                        779
4                      7                   0                        764
5                     10                   1                        928
6                     10                   1                        927
  years_since_lease_commence
1                         36
2                         36
3                         35
4                         35
5                         22
6                         22
library(dplyr)
library(stringr)

# Calculate the median storey for each range
joined_data <- joined_data %>%
  mutate(
    # Extract the lower and upper limits of the storey range
    storey_min = as.numeric(str_extract(storey_range, "^\\d+")),
    storey_max = as.numeric(str_extract(storey_range, "\\d+$")),
    # Calculate the median storey as the average of min and max
    median_storey = (storey_min + storey_max) / 2
  )

# View the updated dataset
head(joined_data)
Simple feature collection with 6 features and 28 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 19216.72 ymin: 35890.2 xmax: 41144.95 ymax: 42301.51
Projected CRS: SVY21 / Singapore TM
             address postal   month          town flat_type block
1     101 SIMEI ST 1 520101 2023-07      TAMPINES    5 ROOM   101
2     101 SIMEI ST 1 520101 2024-09      TAMPINES    5 ROOM   101
3 102 TECK WHYE LANE 680102 2023-06 CHOA CHU KANG    5 ROOM   102
4 102 TECK WHYE LANE 680102 2024-09 CHOA CHU KANG    5 ROOM   102
5 102A PUNGGOL FIELD 821102 2024-08       PUNGGOL    5 ROOM  102A
6 102A PUNGGOL FIELD 821102 2024-09       PUNGGOL    5 ROOM  102A
     street_name storey_range floor_area_sqm flat_model lease_commence_date
1     SIMEI ST 1     01 TO 03            128   Improved                1988
2     SIMEI ST 1     01 TO 03            122   Improved                1988
3 TECK WHYE LANE     01 TO 03            121   Improved                1989
4 TECK WHYE LANE     01 TO 03            121   Improved                1989
5  PUNGGOL FIELD     01 TO 03            110   Improved                2002
6  PUNGGOL FIELD     16 TO 18            110   Improved                2002
     remaining_lease resale_price remaining_lease_yr remaining_lease_mth
1 64 years 03 months       680000                 64                   3
2  63 years 01 month       715000                 63                   1
3 64 years 11 months       520000                 64                  11
4 63 years 08 months       575000                 63                   8
5 77 years 04 months       603888                 77                   4
6 77 years 03 months       655000                 77                   3
                   geometry distance_to_eldercare distance_to_cbd
1  POINT (41144.95 35890.2)             0.7581380        12.79041
2  POINT (41144.95 35890.2)             0.7581380        12.79041
3 POINT (19216.72 39896.49)             0.3276085        14.92359
4 POINT (19216.72 39896.49)             0.3276085        14.92359
5 POINT (36064.38 42301.51)             0.7017365        14.07630
6 POINT (36064.38 42301.51)             0.7017365        14.07630
  distance_to_hawker distance_to_mall distance_to_mrt distance_to_parks
1          0.9245228        0.2710456      0.33254769         0.2564781
2          0.9245228        0.2710456      0.33254769         0.2564781
3          1.3337754        0.3907374      0.08537721         0.3419659
4          1.3337754        0.3907374      0.08537721         0.3419659
5          1.0403345        0.7480958      0.03602671         0.8280045
6          1.0403345        0.7480958      0.03602671         0.8280045
  preschools_within_400m schools_within_400m total_remaining_lease_mths
1                      5                   1                        771
2                      5                   1                        757
3                      7                   0                        779
4                      7                   0                        764
5                     10                   1                        928
6                     10                   1                        927
  years_since_lease_commence storey_min storey_max median_storey
1                         36          1          3             2
2                         36          1          3             2
3                         35          1          3             2
4                         35          1          3             2
5                         22          1          3             2
6                         22         16         18            17
library(ggplot2)

# Distance to nearest eldercare
ggplot(joined_data, aes(x = distance_to_eldercare)) +
  geom_density(fill = "skyblue", alpha = 0.5) +
  labs(title = "Density of Distance to Nearest Eldercare",
       x = "Distance to Eldercare (meters)", y = "Density")

# Distance to nearest CBD
ggplot(joined_data, aes(x = distance_to_cbd)) +
  geom_density(fill = "skyblue", alpha = 0.5) +
  labs(title = "Density of Distance to Nearest CBD",
       x = "Distance to CBD (meters)", y = "Density")

# Distance to nearest hawker center
ggplot(joined_data, aes(x = distance_to_hawker)) +
  geom_density(fill = "lightgreen", alpha = 0.5) +
  labs(title = "Density of Distance to Nearest Hawker",
       x = "Distance to Hawker (meters)", y = "Density")

# Distance to nearest mall
ggplot(joined_data, aes(x = distance_to_mall)) +
  geom_density(fill = "lightgreen", alpha = 0.5) +
  labs(title = "Density of Distance to Nearest Mall",
       x = "Distance to Mall (meters)", y = "Density")

# Distance to nearest MRT station
ggplot(joined_data, aes(x = distance_to_mrt)) +
  geom_density(fill = "coral", alpha = 0.5) +
  labs(title = "Density of Distance to Nearest MRT Station",
       x = "Distance to MRT Station (meters)", y = "Density")

# Distance to nearest park
ggplot(joined_data, aes(x = distance_to_parks)) +
  geom_density(fill = "coral", alpha = 0.5) +
  labs(title = "Density of Distance to Nearest Park",
       x = "Distance to Park (meters)", y = "Density")

# Number of preschools within 500m
ggplot(joined_data, aes(x = preschools_within_400m)) +
  geom_histogram(binwidth = 1, fill = "purple", color = "black") +
  labs(title = "Histogram of Preschools within 500m",
       x = "Number of Preschools within 500m", y = "Frequency")

# Number of schools within 500m
ggplot(joined_data, aes(x = schools_within_400m)) +
  geom_histogram(binwidth = 1, fill = "orange", color = "black") +
  labs(title = "Histogram of Schools within 500m",
       x = "Number of Schools within 500m", y = "Frequency")

ggplot(joined_data, aes(x = median_storey)) +
  geom_histogram(binwidth = 1, fill = "orange", color = "black") +
  labs(title = "Histogram of Median Storey",
       x = "Number of Schools within 500m", y = "Frequency")

ggplot(joined_data, aes(x = total_remaining_lease_mths)) +
  geom_histogram(binwidth = 1, fill = "orange", color = "black") +
  labs(title = "Histogram of Total Remaining Lease Months",
       x = "Number of Schools within 500m", y = "Frequency")

ggplot(joined_data, aes(x = years_since_lease_commence)) +
  geom_histogram(binwidth = 1, fill = "orange", color = "black") +
  labs(title = "Histogram of Years since lease commence",
       x = "Number of Schools within 500m", y = "Frequency")

ggplot(joined_data, aes(x = floor_area_sqm)) +
  geom_histogram(binwidth = 1, fill = "orange", color = "black") +
  labs(title = "Histogram of floor area sqm",
       x = "Number of Schools within 500m", y = "Frequency")

ggplot(joined_data, aes(x = resale_price)) +
  geom_density(binwidth = 1, fill = "orange", color = "black") +
  labs(title = "Histogram of resale price",
       x = "Number of Schools within 500m", y = "Frequency")

Since several variables did not appear to follow a normal distribution, a log transformation was applied to make them more suitable for analysis in the GWR model. Adding 1 to each variable prevents issues with zero values, ensuring valid log transformations. The resulting variables include log-transformed versions of distance measures, nearby amenity counts, median storey, remaining lease, years since lease commencement, and floor area. This transformation helps to stabilize variance, normalize distributions, and potentially enhance model interpretability and performance.

library(dplyr)

# Log-transform all distance and count variables, adding +1 to avoid log(0) issues
joined_data <- joined_data %>%
  mutate(
    log_distance_to_eldercare = log(distance_to_eldercare + 1),
    log_distance_to_cbd = log(distance_to_cbd + 1),
    log_distance_to_hawker = log(distance_to_hawker + 1),
    log_distance_to_mall = log(distance_to_mall + 1),
    log_distance_to_mrt = log(distance_to_mrt + 1),
    log_distance_to_parks = log(distance_to_parks + 1),
    log_preschools_within_400m = log(preschools_within_400m + 1),
    log_schools_within_400m = log(schools_within_400m + 1),
    log_median_storey = log(median_storey + 1),
    log_total_remaining_lease = log(total_remaining_lease_mths + 1),
    log_years_since_lease_commence = log(years_since_lease_commence + 1),
    log_floor_area_sqm = log(floor_area_sqm + 1)
    
  )
library(ggplot2)

# Density plots for log-transformed distance variables
ggplot(joined_data, aes(x = log_distance_to_eldercare)) +
  geom_density(fill = "skyblue", alpha = 0.5) +
  labs(title = "Log Density of Distance to Nearest Eldercare",
       x = "Log Distance to Eldercare", y = "Density")

ggplot(joined_data, aes(x = log_distance_to_cbd)) +
  geom_density(fill = "skyblue", alpha = 0.5) +
  labs(title = "Log Density of Distance to Nearest CBD",
       x = "Log Distance to CBD", y = "Density")

ggplot(joined_data, aes(x = log_distance_to_hawker)) +
  geom_density(fill = "lightgreen", alpha = 0.5) +
  labs(title = "Log Density of Distance to Nearest Hawker",
       x = "Log Distance to Hawker", y = "Density")

ggplot(joined_data, aes(x = log_distance_to_mall)) +
  geom_density(fill = "lightgreen", alpha = 0.5) +
  labs(title = "Log Density of Distance to Nearest Mall",
       x = "Log Distance to Mall", y = "Density")

ggplot(joined_data, aes(x = log_distance_to_mrt)) +
  geom_density(fill = "coral", alpha = 0.5) +
  labs(title = "Log Density of Distance to Nearest MRT Station",
       x = "Log Distance to MRT Station", y = "Density")

ggplot(joined_data, aes(x = log_distance_to_parks)) +
  geom_density(fill = "coral", alpha = 0.5) +
  labs(title = "Log Density of Distance to Nearest Park",
       x = "Log Distance to Park", y = "Density")

# Histograms for log-transformed count variables
ggplot(joined_data, aes(x = log_preschools_within_400m)) +
  geom_histogram(binwidth = 0.1, fill = "purple", color = "black") +
  labs(title = "Log Histogram of Preschools within 500m",
       x = "Log Number of Preschools within 500m", y = "Frequency")

ggplot(joined_data, aes(x = log_schools_within_400m)) +
  geom_histogram(binwidth = 0.1, fill = "orange", color = "black") +
  labs(title = "Log Histogram of Schools within 500m",
       x = "Log Number of Schools within 500m", y = "Frequency")

ggplot(joined_data, aes(x = log_median_storey)) +
  geom_histogram(binwidth = 1, fill = "orange", color = "black") +
  labs(title = "Histogram of Median Storey",
       x = "Number of Schools within 500m", y = "Frequency")

ggplot(joined_data, aes(x = log_total_remaining_lease)) +
  geom_density(binwidth = 1, fill = "orange", color = "black") +
  labs(title = "Histogram of Total Remaining Lease Months",
       x = "Number of Schools within 500m", y = "Frequency")

ggplot(joined_data, aes(x = log_years_since_lease_commence)) +
  geom_density(binwidth = 1, fill = "orange", color = "black") +
  labs(title = "Histogram of Years Since Lease Commence",
       x = "Number of Schools within 500m", y = "Frequency")

ggplot(joined_data, aes(x = log_floor_area_sqm)) +
  geom_density(binwidth = 1, fill = "orange", color = "black") +
  labs(title = "Histogram of Floor Area sqm",
       x = "Number of Schools within 500m", y = "Frequency")

library(dplyr)
library(sf)  # for st_drop_geometry
library(corrplot)

# Step 1: Drop the geometry column
numeric_data <- joined_data %>%
  st_drop_geometry() %>%  # Remove the geometry column
  select(
    log_distance_to_eldercare, log_distance_to_cbd, log_distance_to_hawker, 
    log_distance_to_mall, log_distance_to_mrt, log_distance_to_parks, 
    log_preschools_within_400m, log_schools_within_400m,
    log_median_storey,log_total_remaining_lease, resale_price, log_years_since_lease_commence,log_floor_area_sqm
  )

# Step 2: Compute the correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")  # Use "complete.obs" to handle missing data

# Print the correlation matrix
print(cor_matrix)
                               log_distance_to_eldercare log_distance_to_cbd
log_distance_to_eldercare                    1.000000000          0.30118662
log_distance_to_cbd                          0.301186622          1.00000000
log_distance_to_hawker                       0.111039508          0.34659052
log_distance_to_mall                         0.009091574         -0.17058707
log_distance_to_mrt                          0.078492235          0.03578689
log_distance_to_parks                       -0.017131482          0.37375049
log_preschools_within_400m                   0.023055243          0.11970527
log_schools_within_400m                      0.013676719          0.09870777
log_median_storey                           -0.094844163         -0.19613562
log_total_remaining_lease                    0.079828286          0.24744941
resale_price                                -0.235732333         -0.64274729
log_years_since_lease_commence              -0.029641406         -0.18218299
log_floor_area_sqm                          -0.061250887         -0.08959270
                               log_distance_to_hawker log_distance_to_mall
log_distance_to_eldercare                  0.11103951         0.0090915740
log_distance_to_cbd                        0.34659052        -0.1705870741
log_distance_to_hawker                     1.00000000        -0.1627287931
log_distance_to_mall                      -0.16272879         1.0000000000
log_distance_to_mrt                       -0.04772654         0.3227345062
log_distance_to_parks                      0.11737431        -0.0004525176
log_preschools_within_400m                 0.08556389        -0.2903878034
log_schools_within_400m                    0.01876348        -0.0989959691
log_median_storey                         -0.12826378        -0.0240286633
log_total_remaining_lease                  0.09655721        -0.1667559150
resale_price                              -0.31634902         0.0343916380
log_years_since_lease_commence            -0.04537941         0.1022365068
log_floor_area_sqm                         0.00454388         0.1757466027
                               log_distance_to_mrt log_distance_to_parks
log_distance_to_eldercare              0.078492235         -0.0171314819
log_distance_to_cbd                    0.035786885          0.3737504940
log_distance_to_hawker                -0.047726540          0.1173743134
log_distance_to_mall                   0.322734506         -0.0004525176
log_distance_to_mrt                    1.000000000          0.1100091533
log_distance_to_parks                  0.110009153          1.0000000000
log_preschools_within_400m            -0.214465568          0.0313474403
log_schools_within_400m               -0.217188092          0.0420952613
log_median_storey                     -0.057106989         -0.0733363997
log_total_remaining_lease             -0.005694969          0.1992314719
resale_price                          -0.069614079         -0.2118702786
log_years_since_lease_commence        -0.076149471         -0.1950705844
log_floor_area_sqm                     0.179682293         -0.0341265725
                               log_preschools_within_400m
log_distance_to_eldercare                     0.023055243
log_distance_to_cbd                           0.119705265
log_distance_to_hawker                        0.085563887
log_distance_to_mall                         -0.290387803
log_distance_to_mrt                          -0.214465568
log_distance_to_parks                         0.031347440
log_preschools_within_400m                    1.000000000
log_schools_within_400m                       0.279534178
log_median_storey                            -0.046616068
log_total_remaining_lease                     0.004644186
resale_price                                 -0.084728010
log_years_since_lease_commence                0.091914221
log_floor_area_sqm                           -0.017896573
                               log_schools_within_400m log_median_storey
log_distance_to_eldercare                   0.01367672       -0.09484416
log_distance_to_cbd                         0.09870777       -0.19613562
log_distance_to_hawker                      0.01876348       -0.12826378
log_distance_to_mall                       -0.09899597       -0.02402866
log_distance_to_mrt                        -0.21718809       -0.05710699
log_distance_to_parks                       0.04209526       -0.07333640
log_preschools_within_400m                  0.27953418       -0.04661607
log_schools_within_400m                     1.00000000       -0.01690392
log_median_storey                          -0.01690392        1.00000000
log_total_remaining_lease                  -0.25783108        0.11000803
resale_price                               -0.21226633        0.37665733
log_years_since_lease_commence              0.36913042       -0.10850119
log_floor_area_sqm                          0.06575264       -0.16608588
                               log_total_remaining_lease resale_price
log_distance_to_eldercare                    0.079828286  -0.23573233
log_distance_to_cbd                          0.247449415  -0.64274729
log_distance_to_hawker                       0.096557208  -0.31634902
log_distance_to_mall                        -0.166755915   0.03439164
log_distance_to_mrt                         -0.005694969  -0.06961408
log_distance_to_parks                        0.199231472  -0.21187028
log_preschools_within_400m                   0.004644186  -0.08472801
log_schools_within_400m                     -0.257831083  -0.21226633
log_median_storey                            0.110008033   0.37665733
log_total_remaining_lease                    1.000000000   0.22722493
resale_price                                 0.227224926   1.00000000
log_years_since_lease_commence              -0.940797181  -0.29619826
log_floor_area_sqm                          -0.624374748  -0.02192839
                               log_years_since_lease_commence
log_distance_to_eldercare                         -0.02964141
log_distance_to_cbd                               -0.18218299
log_distance_to_hawker                            -0.04537941
log_distance_to_mall                               0.10223651
log_distance_to_mrt                               -0.07614947
log_distance_to_parks                             -0.19507058
log_preschools_within_400m                         0.09191422
log_schools_within_400m                            0.36913042
log_median_storey                                 -0.10850119
log_total_remaining_lease                         -0.94079718
resale_price                                      -0.29619826
log_years_since_lease_commence                     1.00000000
log_floor_area_sqm                                 0.53335781
                               log_floor_area_sqm
log_distance_to_eldercare             -0.06125089
log_distance_to_cbd                   -0.08959270
log_distance_to_hawker                 0.00454388
log_distance_to_mall                   0.17574660
log_distance_to_mrt                    0.17968229
log_distance_to_parks                 -0.03412657
log_preschools_within_400m            -0.01789657
log_schools_within_400m                0.06575264
log_median_storey                     -0.16608588
log_total_remaining_lease             -0.62437475
resale_price                          -0.02192839
log_years_since_lease_commence         0.53335781
log_floor_area_sqm                     1.00000000
# Step 3: Visualize the correlation matrix (optional)
corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, title = "Correlation Matrix")

This code calculates and visualizes the correlation matrix for log-transformed and numeric variables in the joined_data dataset, excluding the spatial geometry. Missing values in numeric columns are imputed with the median to ensure completeness. The correlation matrix helps identify relationships among variables for further analysis.

In the correlation matrix, the variables log_total_remaining_lease and log_floor_area_sqm show a strong negative correlation. This suggests that as the total remaining lease (log-transformed) decreases, the floor area (log-transformed) tends to increase, or vice versa. This relationship could reflect a trend where older properties, which generally have less remaining lease, might also tend to have larger floor areas.

library(dplyr)

# Impute missing values in numerical columns with the median
joined_data <- joined_data %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

The dataset is split into training and testing sets with an 80/20 proportion, where 80% of the data is used for training the model, and 20% is reserved for testing its performance. This ensures that the model is trained on a substantial portion of the data while retaining enough data for evaluating its predictive accuracy.

This linear regression model (price_mlr) aims to predict HDB resale prices using log-transformed distance variables and property characteristics as predictors. The model shows that several factors—such as distance to the CBD, MRT, parks, the number of nearby preschools and schools, and the median storey—are significant predictors of resale prices. With an adjusted R-squared of 0.6283, the model explains approximately 63% of the variance in resale prices, indicating a moderately strong fit.

set.seed(1234)
resale_split <- initial_split(joined_data, 
                              prop = 8/10,)
train_data <- training(resale_split)
test_data <- testing(resale_split)
# Correct model formula with + between variables
price_mlr <- lm(
  resale_price ~ log_distance_to_eldercare + log_distance_to_cbd + log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt + log_distance_to_parks + log_preschools_within_400m + log_schools_within_400m + log_median_storey + log_years_since_lease_commence,
  data = train_data
)

# Check the model summary
summary(price_mlr)

Call:
lm(formula = resale_price ~ log_distance_to_eldercare + log_distance_to_cbd + 
    log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt + 
    log_distance_to_parks + log_preschools_within_400m + log_schools_within_400m + 
    log_median_storey + log_years_since_lease_commence, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-283194  -62333  -10150   51438  423761 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     1613614      29354  54.971  < 2e-16 ***
log_distance_to_eldercare        -17138       8247  -2.078 0.037865 *  
log_distance_to_cbd             -258477       7718 -33.489  < 2e-16 ***
log_distance_to_hawker           -53686       9209  -5.830 6.68e-09 ***
log_distance_to_mall              -3401      13078  -0.260 0.794866    
log_distance_to_mrt              -41930      10881  -3.854 0.000121 ***
log_distance_to_parks            -17405      10667  -1.632 0.102950    
log_preschools_within_400m        12705       6231   2.039 0.041612 *  
log_schools_within_400m           -8768       5743  -1.527 0.127042    
log_median_storey                 48225       3974  12.135  < 2e-16 ***
log_years_since_lease_commence   -95171       3928 -24.228  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 92660 on 1630 degrees of freedom
Multiple R-squared:  0.642, Adjusted R-squared:  0.6399 
F-statistic: 292.4 on 10 and 1630 DF,  p-value: < 2.2e-16

This code uses the Variance Inflation Factor (VIF) to check for multicollinearity in the global regression model predicting resale_price. The VIF values for each predictor in global_model will help identify variables with high collinearity (typically indicated by VIF values greater than 5 or 10). High VIF values suggest that some predictors may be strongly correlated with others, which could impact the model’s stability and interpretability. Reducing multicollinearity may involve removing or combining highly correlated variables.

train_data_sp <- as_Spatial(train_data)
train_data_sp
class       : SpatialPointsDataFrame 
features    : 1641 
extent      : 11755.72, 42441.24, 28457.97, 48339.17  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 40
names       :            address, postal,   month,       town, flat_type, block,   street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date,    remaining_lease, resale_price, remaining_lease_yr, remaining_lease_mth, ... 
min values  :     101 SIMEI ST 1, 080109, 2023-01, ANG MO KIO,    5 ROOM,   101,  ADMIRALTY DR,     01 TO 03,            104,       3Gen,                1974, 48 years 11 months,       470000,                 48,                   1, ... 
max values  : 997C BUANGKOK CRES, 824662, 2024-09,     YISHUN,    5 ROOM,  997C, YUNG KUANG RD,     46 TO 48,            150,    Type S2,                2020, 95 years 05 months,      1588000,                 95,                  11, ... 
library(car)

# Run VIF on the global model to check for multicollinearity
global_model <- lm(resale_price ~ log_distance_to_eldercare + log_distance_to_cbd + 
                     log_distance_to_hawker + log_distance_to_mall + 
                     log_distance_to_mrt + log_distance_to_parks + 
                     log_preschools_within_400m + log_schools_within_400m + 
                     log_median_storey + log_years_since_lease_commence + 
                     log_floor_area_sqm, 
                   data = train_data_sp)

vif(global_model)
     log_distance_to_eldercare            log_distance_to_cbd 
                      1.153131                       1.574228 
        log_distance_to_hawker           log_distance_to_mall 
                      1.158551                       1.283349 
           log_distance_to_mrt          log_distance_to_parks 
                      1.287822                       1.240598 
    log_preschools_within_400m        log_schools_within_400m 
                      1.200550                       1.336858 
             log_median_storey log_years_since_lease_commence 
                      1.091242                       1.821248 
            log_floor_area_sqm 
                      1.582944 

Based on these VIF results, there is no indication of multicollinearity among the predictors, as all VIF values are below 2. This suggests that none of the variables are excessively correlated with each other, supporting the model’s stability and interpretability.

This code determines an optimal adaptive bandwidth (bw_adaptive) for a Geographically Weighted Regression (GWR) model that predicts resale_price using various log-transformed distance and property variables. Using cross-validation (approach = "CV") with a Gaussian kernel, the code adjusts the bandwidth based on local data density, which allows the model to account for spatial variation more accurately. The longlat = FALSE parameter specifies planar distance calculations, aligning with Singapore’s SVY21 coordinate system. This adaptive bandwidth helps capture local effects more effectively across different data densities within the region.

bw_adaptive <- bw.gwr(
  resale_price ~ log_distance_to_eldercare + log_distance_to_cbd + 
    log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt + 
    log_distance_to_parks + log_preschools_within_400m + 
    log_schools_within_400m + log_median_storey + 
    log_years_since_lease_commence + log_floor_area_sqm,
  data = train_data_sp,
  approach = "CV",
  kernel = "gaussian",
  adaptive = TRUE,
  longlat = FALSE
)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
Adaptive bandwidth: 1021 CV score: 1.140465e+13 
Adaptive bandwidth: 639 CV score: 1.074045e+13 
Adaptive bandwidth: 401 CV score: 9.800083e+12 
Adaptive bandwidth: 256 CV score: 9.064662e+12 
Adaptive bandwidth: 164 CV score: 8.295959e+12 
Adaptive bandwidth: 109 CV score: 6.803495e+12 
Adaptive bandwidth: 73 CV score: 5.696787e+12 
Adaptive bandwidth: 53 CV score: 4.968459e+12 
Adaptive bandwidth: 38 CV score: 9.272527e+14 
Adaptive bandwidth: 59 CV score: 5.217215e+12 
Adaptive bandwidth: 45 CV score: 4.692108e+12 
Adaptive bandwidth: 44 CV score: 4.625041e+12 
Adaptive bandwidth: 39 CV score: 1.262828e+15 
Adaptive bandwidth: 42 CV score: 4.60572e+12 
Adaptive bandwidth: 46 CV score: 4.708241e+12 
Adaptive bandwidth: 44 CV score: 4.625041e+12 
Adaptive bandwidth: 45 CV score: 4.692108e+12 
Adaptive bandwidth: 44 CV score: 4.625041e+12 
Adaptive bandwidth: 44 CV score: 4.625041e+12 
Adaptive bandwidth: 43 CV score: 4.604421e+12 
Adaptive bandwidth: 41 CV score: 7.172298e+14 
Adaptive bandwidth: 41 CV score: 7.172298e+14 
Adaptive bandwidth: 40 CV score: 7.171396e+14 
Adaptive bandwidth: 40 CV score: 7.171396e+14 
Adaptive bandwidth: 39 CV score: 1.262828e+15 
Adaptive bandwidth: 39 CV score: 1.262828e+15 
Adaptive bandwidth: 38 CV score: 9.272527e+14 
Adaptive bandwidth: 38 CV score: 9.272527e+14 
Adaptive bandwidth: 37 CV score: 9.272069e+14 
Adaptive bandwidth: 37 CV score: 9.272069e+14 
Adaptive bandwidth: 36 CV score: 9.394113e+14 
Adaptive bandwidth: 36 CV score: 9.394113e+14 
Adaptive bandwidth: 35 CV score: 9.393867e+14 
Adaptive bandwidth: 35 CV score: 9.393867e+14 
Adaptive bandwidth: 34 CV score: 1.025499e+15 
Adaptive bandwidth: 34 CV score: 1.025499e+15 

This code runs an adaptive Geographically Weighted Regression (GWR) model (gwr_adaptive) to predict resale_price using various log-transformed distance and property characteristics. The model applies the optimal adaptive bandwidth (bw_adaptive) calculated previously, allowing it to vary based on data density. A Gaussian kernel is used to assign weights, and adaptive=TRUE enables the model to capture local spatial variations more accurately. With longlat = FALSE, distances are calculated in planar units, suitable for the SVY21 projection system. This GWR model helps reveal how the influence of each variable changes across different locations.

gwr_adaptive <- gwr.basic(formula = resale_price ~log_distance_to_eldercare + log_distance_to_cbd + log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt + log_distance_to_parks + log_preschools_within_400m + log_schools_within_400m + log_median_storey + log_years_since_lease_commence + log_floor_area_sqm,
                          data=train_data_sp,
                          bw=bw_adaptive, 
                          kernel = 'gaussian', 
                          adaptive=TRUE,
                          longlat = FALSE)

This output summarizes the results of the Geographically Weighted Regression (GWR) model for predicting HDB resale prices.

  1. Global Regression Results: The global model shows an adjusted R-squared of 0.6958, explaining around 70% of the variance in resale prices with significant predictors including distances to CBD, MRT stations, parks, nearby amenities, and property characteristics like storey level and lease age.

  2. GWR Model Performance: The GWR model, with an adaptive bandwidth that uses 34 nearest neighbors, demonstrates improved performance with an R-squared of 0.9374, indicating that the model explains around 94% of the variance in resale prices when accounting for spatial variation. The diagnostic metrics (AICc, AIC, and BIC) suggest a better model fit compared to the global regression.

  3. Spatial Variation in Coefficients: The GWR coefficients vary across locations, as shown in the summary with ranges for each predictor. This variation reflects the differing influence of factors like proximity to amenities and lease duration across different areas, highlighting the model’s ability to capture local effects.

gwr_adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2024-11-09 19:49:51.140074 
   Call:
   gwr.basic(formula = resale_price ~ log_distance_to_eldercare + 
    log_distance_to_cbd + log_distance_to_hawker + log_distance_to_mall + 
    log_distance_to_mrt + log_distance_to_parks + log_preschools_within_400m + 
    log_schools_within_400m + log_median_storey + log_years_since_lease_commence + 
    log_floor_area_sqm, data = train_data_sp, bw = bw_adaptive, 
    kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  resale_price
   Independent variables:  log_distance_to_eldercare log_distance_to_cbd log_distance_to_hawker log_distance_to_mall log_distance_to_mrt log_distance_to_parks log_preschools_within_400m log_schools_within_400m log_median_storey log_years_since_lease_commence log_floor_area_sqm
   Number of data points: 1641
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-246438  -53643   -5081   41269  357502 

   Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
   (Intercept)                    -2361196     227104 -10.397  < 2e-16 ***
   log_distance_to_eldercare         -3085       7602  -0.406  0.68490    
   log_distance_to_cbd             -257779       7075 -36.433  < 2e-16 ***
   log_distance_to_hawker           -60918       8452  -7.208 8.67e-13 ***
   log_distance_to_mall             -14074      12004  -1.172  0.24117    
   log_distance_to_mrt              -79198      10196  -7.768 1.41e-14 ***
   log_distance_to_parks            -27029       9794  -2.760  0.00585 ** 
   log_preschools_within_400m        11003       5713   1.926  0.05429 .  
   log_schools_within_400m            4203       5316   0.791  0.42928    
   log_median_storey                 55758       3668  15.201  < 2e-16 ***
   log_years_since_lease_commence  -136177       4287 -31.764  < 2e-16 ***
   log_floor_area_sqm               859580      48767  17.626  < 2e-16 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 84940 on 1629 degrees of freedom
   Multiple R-squared: 0.6994
   Adjusted R-squared: 0.6974 
   F-statistic: 344.5 on 11 and 1629 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 1.175429e+13
   Sigma(hat): 84685.42
   AIC:  41920.82
   AICc:  41921.04
   BIC:  40446.3
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 43 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                                         Min.     1st Qu.      Median
   Intercept                      -11436554.2  -3064686.5  -1217942.1
   log_distance_to_eldercare        -568010.3    -62367.5     -9860.7
   log_distance_to_cbd             -9438534.0   -506318.3   -206796.2
   log_distance_to_hawker           -240639.1    -99748.1    -44161.9
   log_distance_to_mall             -475424.0   -165282.9    -80837.9
   log_distance_to_mrt             -1335909.3   -149289.1   -100247.9
   log_distance_to_parks            -453622.8    -52820.3       517.6
   log_preschools_within_400m        -28983.4      5749.9     18729.3
   log_schools_within_400m          -112888.8    -25465.9     -9810.8
   log_median_storey                  28330.2     38574.1     47551.7
   log_years_since_lease_commence   -396205.3   -171516.1   -118938.1
   log_floor_area_sqm              -2247797.6    339575.5    578504.2
                                      3rd Qu.     Max.
   Intercept                         534245.7 22477664
   log_distance_to_eldercare          38130.6   394258
   log_distance_to_cbd               -33750.5  3787337
   log_distance_to_hawker              5332.0   443729
   log_distance_to_mall              -11268.9  1137156
   log_distance_to_mrt               -46246.5   233541
   log_distance_to_parks              90287.4  1269492
   log_preschools_within_400m         37297.2    96400
   log_schools_within_400m             9748.4   673855
   log_median_storey                  59361.7    92068
   log_years_since_lease_commence    -85226.8    43027
   log_floor_area_sqm                963719.8  2099297
   ************************Diagnostic information*************************
   Number of data points: 1641 
   Effective number of parameters (2trace(S) - trace(S'S)): 244.048 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 1396.952 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 40365.14 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 40106.68 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 39746.41 
   Residual sum of squares: 3.499648e+12 
   R-square value:  0.9104959 
   Adjusted R-square value:  0.8948483 

   ***********************************************************************
   Program stops at: 2024-11-09 19:49:52.105098 
test_data_sp <- test_data %>%
  as_Spatial()
test_data_sp
class       : SpatialPointsDataFrame 
features    : 411 
extent      : 11755.72, 42441.24, 28501.49, 48154.92  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 40
names       :            address, postal,   month,       town, flat_type, block,     street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date,    remaining_lease, resale_price, remaining_lease_yr, remaining_lease_mth, ... 
min values  :     101 SIMEI ST 1, 080109, 2023-01, ANG MO KIO,    5 ROOM,   101, ANCHORVALE CRES,     01 TO 03,            104,       3Gen,                1975, 50 years 07 months,       488000,                 50,                   1, ... 
max values  : 997C BUANGKOK CRES, 824662, 2024-09,     YISHUN,    5 ROOM,  997C,   YUNG KUANG RD,     46 TO 48,            139,    Type S2,                2020,  95 years 01 month,      1450000,                 95,                  11, ... 

This code performs Geographically Weighted Regression (GWR) predictions on a test dataset using an adaptive bandwidth.

  1. Bandwidth Selection for Test Data: gwr_bw_test_adaptive is calculated using cross-validation on the test_data_sp dataset to identify an optimal adaptive bandwidth. This ensures that the GWR model adapts to local data densities within the test dataset.

  2. GWR Prediction: The gwr.predict function applies the GWR model, using the previously computed bw_adaptive from the training data, and makes predictions on test_data_sp. The model uses the same Gaussian kernel and adaptive bandwidth, allowing for spatially varying predictions that account for the influence of local variables.

This setup allows the model to generate spatially aware predictions on unseen data, leveraging the training data’s local patterns to estimate resale prices in the test dataset.

gwr_bw_test_adaptive <- bw.gwr(resale_price ~ log_distance_to_eldercare + log_distance_to_cbd + log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt + log_distance_to_parks + log_preschools_within_400m + log_schools_within_400m + log_median_storey + log_years_since_lease_commence + log_floor_area_sqm,
                  data=test_data_sp,
                  approach="CV",
                  kernel="gaussian",
                  adaptive=TRUE,
                  longlat=FALSE)
Adaptive bandwidth: 261 CV score: 2.708293e+12 
Adaptive bandwidth: 169 CV score: 2.624083e+12 
Adaptive bandwidth: 111 CV score: 2.506149e+12 
Adaptive bandwidth: 76 CV score: 2.362042e+12 
Adaptive bandwidth: 53 CV score: 2.209377e+12 
Adaptive bandwidth: 40 CV score: 2.110797e+12 
Adaptive bandwidth: 31 CV score: 1.984621e+12 
Adaptive bandwidth: 26 CV score: 1.845974e+12 
Adaptive bandwidth: 22 CV score: 1.722423e+12 
Adaptive bandwidth: 20 CV score: 1.663859e+12 
Adaptive bandwidth: 18 CV score: 1.64053e+12 
Adaptive bandwidth: 18 CV score: 1.64053e+12 
gwr_pred <- gwr.predict(
  formula = resale_price ~ log_distance_to_eldercare + log_distance_to_cbd + 
             log_distance_to_hawker + log_distance_to_mall + log_distance_to_mrt + 
             log_distance_to_parks + log_preschools_within_400m + 
             log_schools_within_400m + log_median_storey + 
             log_years_since_lease_commence + log_floor_area_sqm, 
  data = train_data_sp, 
  predictdata = test_data_sp, 
  bw = bw_adaptive,
  kernel = "gaussian", 
  adaptive = TRUE,  
  longlat = FALSE
)
coords <- st_coordinates(joined_data)
coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)
train_data <- train_data %>% 
  st_drop_geometry()
test_data <- cbind(test_data, coords_test) %>%
  st_drop_geometry()
head(gwr_pred$SDF@data)
  Intercept_coef log_distance_to_eldercare_coef log_distance_to_cbd_coef
1      -687237.8                       39792.38                -29427.61
2     -1801476.0                      118135.27               -276284.05
3     -1731721.5                      -26067.61               -320298.78
4     -1731721.5                      -26067.61               -320298.78
5       264182.9                       32503.65               -129791.08
6     -3157231.8                      -86252.99                221432.28
  log_distance_to_hawker_coef log_distance_to_mall_coef
1                   -77805.69                 -62840.16
2                   -56186.52                -125819.06
3                   -70223.88                 -33314.58
4                   -70223.88                 -33314.58
5                   -40456.81                 -91071.18
6                  -207260.01                -164763.48
  log_distance_to_mrt_coef log_distance_to_parks_coef
1              -182929.483                  -68880.62
2                -9570.946                 -102371.54
3              -140599.491                  -30990.31
4              -140599.491                  -30990.31
5              -182282.183                  -50060.30
6               -95812.948                   97958.95
  log_preschools_within_400m_coef log_schools_within_400m_coef
1                        33679.05                   -53124.014
2                        32803.40                   -21317.726
3                       -20727.88                     2265.544
4                       -20727.88                     2265.544
5                        34923.72                   -50718.916
6                        37297.22                   -25465.899
  log_median_storey_coef log_years_since_lease_commence_coef
1               60244.01                          -167736.15
2               59135.65                           -91733.91
3               69387.92                          -153224.61
4               69387.92                          -153224.61
5               62339.57                          -145775.69
6               37689.31                          -137621.84
  log_floor_area_sqm_coef prediction prediction_var
1                436231.9   727775.1     2880857279
2                712258.1   573442.2     2939356035
3                798510.9   680130.2     2598057240
4                798510.9   743709.7     2605157674
5                274358.9   599134.0     2795978724
6                781639.8   585460.8     2620203481
# Extract coefficients and test data as data frames
coefficients <- gwr_pred$SDF@data
test_data_df <- as.data.frame(test_data_sp)

# Calculate the predicted values manually
predicted_values <- coefficients$Intercept_coef +
  coefficients$log_distance_to_eldercare_coef * test_data_df$log_distance_to_eldercare +
  coefficients$log_distance_to_cbd_coef * test_data_df$log_distance_to_cbd +
  coefficients$log_distance_to_hawker_coef * test_data_df$log_distance_to_hawker +
  coefficients$log_distance_to_mall_coef * test_data_df$log_distance_to_mall +
  coefficients$log_distance_to_mrt_coef * test_data_df$log_distance_to_mrt +
  coefficients$log_distance_to_parks_coef * test_data_df$log_distance_to_parks +
  coefficients$log_preschools_within_400m_coef * test_data_df$log_preschools_within_400m +
  coefficients$log_schools_within_400m_coef * test_data_df$log_schools_within_400m +
  coefficients$log_median_storey_coef * test_data_df$log_median_storey +
  coefficients$log_years_since_lease_commence_coef * test_data_df$log_years_since_lease_commence +
  coefficients$log_floor_area_sqm_coef * test_data_df$log_floor_area_sqm

# Create a comparison data frame
comparison_data <- data.frame(
  Actual = test_data_sp$resale_price,
  Predicted = predicted_values
)

# Plot the actual vs. predicted values
library(ggplot2)
ggplot(comparison_data, aes(x = Predicted, y = Actual)) +
  geom_point(color = "black", alpha = 0.6) +
  labs(title = "Actual vs Predicted Resale Prices",
       x = "Predicted Resale Price",
       y = "Actual Resale Price") +
  theme_minimal()

library(ggplot2)

# Modify the x-axis limit, for example, setting it to a maximum of 1 million
ggplot(comparison_data, aes(x = Predicted, y = Actual)) +
  geom_point(color = "black", alpha = 0.6) +
  labs(title = "Actual vs Predicted Resale Prices",
       x = "Predicted Resale Price",
       y = "Actual Resale Price") +
  xlim(0, 1e6) +  # Set the x-axis limit to a maximum of 1 million
  theme_minimal()